Update create data package utility#104
Open
r-b-g-b wants to merge 6 commits intoDataONEorg:masterfrom
Open
Conversation
r-b-g-b
commented
Jan 23, 2025
Author
r-b-g-b
left a comment
There was a problem hiding this comment.
Adding some inline comments to explain the changes!
| pid, SYSMETA_FORMATID, sci_obj | ||
| ) | ||
| client.create(pid, io.StringIO(sci_obj), sys_meta) | ||
| client.create(pid, io.BytesIO(sci_obj), sys_meta) |
Author
There was a problem hiding this comment.
Currently, this line results in the error:
TypeError Traceback (most recent call last)
Cell In[62], line 8
6 for file_path in files_in_group:
7 print(" File: {}".format(file_path))
----> 8 create_science_object_on_member_node(client, file_path)
File ~/projects/cib-data-infrastructure/.venv/lib/python3.13/site-packages/d1_util/create_data_packages.py:178, in create_science_object_on_member_node(client, file_path)
174 sci_obj = open(file_path, "rb").read()
175 sys_meta = generate_system_metadata_for_science_object(
176 pid, SYSMETA_FORMATID, sci_obj
177 )
--> 178 client.create(pid, io.StringIO(sci_obj), sys_meta)
TypeError: initial_value must be str or None, not bytes
That makes sense, since the file is opened in "rb" binary mode so needs to be wrapped with BytesIO instead of StringIO (which expects a decoded str).
| # default, use the DataONE production environment for resolving the object | ||
| # URIs. To use the resource map generator in a test environment, pass the base | ||
| # url to the root CN in that environment in the dataone_root parameter. | ||
| resource_map_generator = d1_common.resource_map.ResourceMapGenerator() |
Author
There was a problem hiding this comment.
As is I get the error:
AttributeError Traceback (most recent call last)
Cell In[83], line 1
----> 1 create_package_on_member_node(client, files_in_group)
Cell In[65], line 90, in create_package_on_member_node(client, files_in_group)
88 package_pid = group_name(files_in_group[0])
89 pids = [os.path.basename(p) for p in files_in_group]
---> 90 resource_map = create_resource_map_for_pids(package_pid, pids)
91 sys_meta = generate_system_metadata_for_science_object(
92 package_pid, RESOURCE_MAP_FORMAT_ID, resource_map
93 )
94 client.create(package_pid, io.StringIO(resource_map), sys_meta)
Cell In[65], line 102, in create_resource_map_for_pids(package_pid, pids)
97 def create_resource_map_for_pids(package_pid, pids):
98 # Create a resource map generator that will generate resource maps that, by
99 # default, use the DataONE production environment for resolving the object
100 # URIs. To use the resource map generator in a test environment, pass the base
101 # url to the root CN in that environment in the dataone_root parameter.
--> 102 resource_map_generator = d1_common.resource_map.ResourceMapGenerator()
103 return resource_map_generator.simple_generate_resource_map(
104 package_pid, pids[0], pids[1:]
105 )
AttributeError: module 'd1_common.resource_map' has no attribute 'ResourceMapGenerator'
I believe this class has since been removed and its method simple_generate_resource_map replaced with d1_common.resource_map.createSimpleResourceMap.
| package_pid, pids | ||
| ).serialize_to_transport() | ||
| sys_meta = generate_system_metadata_for_science_object( | ||
| package_pid, RESOURCE_MAP_FORMAT_ID, resource_map |
Author
There was a problem hiding this comment.
If we don't serialize the resource map, we get an error here:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[97], line 1
----> 1 create_package_on_member_node(client, files_in_group)
Cell In[96], line 91, in create_package_on_member_node(client, files_in_group)
89 pids = [os.path.basename(p) for p in files_in_group]
90 resource_map = create_resource_map_for_pids(package_pid, pids)
---> 91 sys_meta = generate_system_metadata_for_science_object(
92 package_pid, RESOURCE_MAP_FORMAT_ID, resource_map
93 )
94 client.create(package_pid, io.BytesIO(resource_map), sys_meta)
Cell In[96], line 111, in generate_system_metadata_for_science_object(pid, format_id, science_object)
109 def generate_system_metadata_for_science_object(pid, format_id, science_object):
110 size = len(science_object)
--> 111 md5 = hashlib.md5(science_object).hexdigest()
112 now = d1_common.date_time.utc_now()
113 sys_meta = generate_sys_meta(pid, format_id, size, md5, now)
TypeError: object supporting the buffer API required
Since the science_object is expected to be a sequence of bytes.
datadavev
approved these changes
Jan 24, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two changes to the
d1_util.create_data_packagesto let it run correctly. I added some more detail inline in the "Files changed" tab.